Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not report a match for ambiguous sequences #827

Merged
merged 1 commit into from
Dec 13, 2024
Merged

Do not report a match for ambiguous sequences #827

merged 1 commit into from
Dec 13, 2024

Conversation

marcelm
Copy link
Owner

@marcelm marcelm commented Dec 11, 2024

This is a backwards-incompatible change, and therefore the version number will be bumped to 5.0.

This only applies when an index is used.

At index creation time, ambiguous sequences (those that would match two or more adapters equally well), are now removed from the index. That is, when they are encountered in a read, they will not be assigned to any adapter.

The intention is to reduce bias because we previously would report a match to one of the adapters (but which one was more or less randomly determined at index creation time).

@marcelm
Copy link
Owner Author

marcelm commented Dec 11, 2024

I’ll let this sit here for a while in case anyone wants to object.

I initially got the request to enable this behavior using a command-line switch, but ignoring ambiguous matches is IMO the only sensible thing, so I want to make it the default.

@rhpvorderman
Copy link
Collaborator

A jump to 5.0 would also be a nice moment to look at all the CLI flags and streamline a few things there. From the top of my head, there is a --max-average-error-rate and a --minimum-length so there is some inconsistency. I added --max-average-error-rate at some point where it sould have been --maximum-average-error-rate obviously.

@marcelm
Copy link
Owner Author

marcelm commented Dec 13, 2024

Ah, I never noticed the inconsistency. I think I can live with that particular one.

I want Cutadapt’s CLI to be really stable, even on major version bumps. So there would at least have to be aliases such that old command lines continue to work, but minor inconsistencies aren’t worth it IMO. What’s bothering me more is that the syntax for specifying different lengths for R1 and R2 is -m LEN1:LEN2, whereas everything else that’s paired-end related uses separate options (lowercase for R1, uppercase for R2).

And I’d actually lean towards shorter --max-length and --min-length aliases instead of the very long --maximum-average-error-rate.

This only applies when an index is used.

At index creation time, ambiguous sequences (those that would match two or
more adapters equally well), are now *removed* from the index. That is, when
they are encountered in a read, they will not be assigned to any adapter.

The intention is to reduce bias because we previously would report a match
to one of the adapters (but which one was more or less randomly determined
at index creation time).
@marcelm marcelm merged commit 647c442 into main Dec 13, 2024
21 checks passed
@marcelm marcelm deleted the ambiguous branch December 13, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants